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Abstract 

The goal of the project is to extract content within 
table in document images based on learnt patterns. 
Real-world users i.e., clients first provide a set of key 
fields within the table which they think are important. 
These are first used to represent the graph where nodes 
are labelled with semantics including other features and 
edges are attributed with relations. Attributed rela- 
tional graph (ARG) is then employed to mine similar 
graphs from a document image. Each mined graph will 
represent an item within the table, and hence a set of 
such graphs will compose a table. We have validated 
the concept by using a real-world industrial problem. 

1 Introduction 

In document analysis and processing, table extrac- 
tion from document images has been received an im- 
portant attention since it contains key information. In 
the context of table extraction [1-4], document image 
analysis and processing basically describes table either 
in terms of lines and (un) analysed text blocks, a set 
of cells resembling the two-dimensional grid or a set 
of strings that are integrated with each other via rela- 
tions, for instance. 

Basically, table detection and its structure recogni- 
tion are two major tasks. Table detection can be taken 
as a primary issue, which is however does not provide 
a complete solution [5] since one needs to be able to 
extract key fields within it. Existing methods such as 
table segmentation [6] do not extract key fields, nor do 
they explicitly perform the content understanding [7]. 
Note that structural information by considering rela- 
tions between the contents, for instance can be very 
useful in indexing and retrieving document informa- 
tion [2]. To analyse table- forms structure, rulings tech- 
niques are basically limited without a priori knowledge 
about table organisation [1]. Such concepts are com- 
pletely failed since not all tables possess graphical lines. 
Besides, plain ascii texts, text blocks are used. Detect- 
ing columns, lines and headers, and representing them 
in terms of graph, for instance is interesting since it 
contains structural information. In order to fully ex- 
ploit table in the scanned documents rather than just 
outlining the overall boundary, it is interesting to ex- 
tract those fields that are important or meaningful for 
the clients. To handle this, in this paper, key fields are 
provided by the clients. These key fields are then used 
to build a graph so that it can be applied for table 
extraction in the absence of clients. 

The rest of the paper is organised as follows. We 
start with explaining the proposed method in Sec- 
tion 2. Full experiments are reported and analysed 
in Section 3. The paper is concluded in Section 4. 
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Figure 1. Work-fiow showing two consecutive 
phases: graph-based pattern representation and 
graph mining, to handle table extraction. 



2 Proposed method 

Generally speaking, table is composed of similar 
items (sometimes just a single) even when columns 
alignment and corresponding text fiow (either in a sin- 
gle or multiple lines) are not guaranteed. Given an in- 
put pattern (i.e., an item, for instance) from a client, 
finding similar patterns from the document is the core 
part of the paper. It not only extracts important fields 
(in accordance with the client) but also configures table 
represented by a set of similar patterns. To handle this, 
we first represent an input pattern via an ARG and 
perform graph mining so that similar graphs can be 
extracted that are structurally and semantically simi- 
lar. Fig. 1 shows a screen-shot of the overall idea. 

2.1 Graph-based pattern representation 

In any document d, the clients provide input pat- 
tern(s) while showing the interest of the particular 
type t of table in either header, body or footer zone: 
table^ = {pattern^, n G [1,N]}, where N can be arbi- 
trary. An example of input pattern is shown in Fig. 2 
i.e., it is just a collection of the selected key fields: 
{fields }f^;L- To represent each field, we define a fea- 
ture set F as {feature/} „^^. For any z-th field, we can 
formally represent feature as field^ = { 

(box: [left, top, right, bottom]); (wSep: words separation); 
(value: content); (noW: number of words) ; 

(type: content type); (noL: number of lines); (1) 

(size: string length); (label: date and price, 

for instance.)} 

The labels are the derivative of features, representing 
semantic values via regular expressions. Thanks to the 
regular expressions, we are able to express a wide range 
of string values even when we have possible OCR errors 
due to broken characters and characters are connected 
with graphics, for instance. To exploit relative posi- 
tioning between the key fields, we basically use bound- 
ing box and its projection into 3x3 partitions [8] (de- 
fined in IR^ i.e., left^ rights ...). For more precision, 
we integrate the level of neighbourhood k into the basic 
predefined set of spatial predicates, we have 



spatial predicatcj.^ ^^ (fields, fieldj 



(2) 
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Figure 2. An ex- 
ample of the input 
pattern and the 
corresponding graph 
that includes miss- 
ing fields. 



Formally, /c = for an adjacent (an immediate field), 
and k varies from 1 to A — 1 for non- adjacent ones. 
Note that ki and /c2 represent horizontal and vertical 
orientations, respectively. 

Now, we introduce a 4-tuple ARG 



G(F,^,Fy,F^), 



where 



• V is a finite set of nodes (fields); 

• E C V X V i.e., a finite set of edges and each 
Tij G £^ is a pair of {vi^ Vj) where Vi^ Vj G V; 

• Fv '. V ^ Ly, Lv represents a set of nodes as 
well as their labels £; and 

• Fe : E -^ Re, Re represents the edges via rela- 
tions. 

To make graph complete, we also include non-selected 
fields which are mainly missing and neighbouring 
fields. To know how many words can be taken for 
a single field, we simply use intra-field (i.e., maximum 
distance between the words in a single field) knowledge 
from the selected key fields. 

2.2 Content extraction via graph mining 

Given the pattern graph Q, to extract similar graphs 
from a document, it starts with pivotal nodes selec- 
tion in a document and perform relation assignment 
to compute feature score between the pairs of nodes. 
Relations assignment repeats until a similar graph G 
is achieved, with respect to Q. 

Pivotal nodes selection. In a predefined set C of 
labels such as price^ date^ address and description 
in the domain, for every node v^ in pattern graph 
Q, the corresponding label i^ G C is defined i.e., 
V^ = {{v^jf),i = l...V^}. Having these labelled 
nodes {{Vi^^i)} in a pattern graph Q, the target is 
to select nodes sharing identical labels {{vi^^)} from 
a document d. We now, refer the selected nodes as 
pivotal nodes. 

Feature score computation. Each pivotal node is 
taken and started to validate relations with neighbour- 
ing nodes in a document, as in pattern graph. To com- 
pute feature score between the pair of nodes (v^, Vj) in a 
document with respect to {v^^ v^) G Q, their respective 



relations must be identical 



I.e., 4 



validates with 



More formally, we can compute feature score between 
two corresponding nodes v^ and v as f .scoTe{v^ ^ v) = 



1 : label in v^ = label in v, and 

1 v^ ^ feature f ,. 

fl^f^f ^ ^v^,v ' otherwise. 



(3) 



where A/ G [0, 1] provides weight to each features 
used to compute feature matching score S(j. For each 
particular feature, weight Xf can be varied according 
to its robustness and so is application dependent. 
Given two strings: x reference and y primary, we 
compute feature (like string value^ number of words 
and size {of. Eq. (1))) matching scores as follows. 

• String type: 

^x^,y^ = 1 — (Levenshtein dist.(x, ?/)/max(x, y)), where 
we treat numerals {0 — 9}, all alphabets {A — Z^a — z} 
and symbols equally. 

• Number of words in a string: 

gword ^ I _ (dist.^^^'^(x,^)/max(x,^)) i.e., an abso- 
lute difference in number words is normalised by the 
maximum number of words. 

• String size: 

^length ^ ^ _ (dist.^"^^'^(x, ^)/max(x, ^)) i.e., an 
absolute difference in size (number of letters) is 
normalised by its maximum size. 

Following Fig. 3, let us elaborate a concept of match- 
ing. To simplify the explanation, let us first create a 
relation vector space from a pattern graph and then 
realise the assignment process for each pivotal node in 
a document. Taking a single pivotal node vi from a 
data graph G (having identical label with respect to 
vf in Q i.e., ^f = f^ G £), the idea is to assign rela- 
tions {r^2 5 ^13 5 ^14} i^ data graph G. We validate rela- 
tions {ri2,ri3} one-by-one and compute feature score 
in parallel. It provides G C Q. However, an addition 
of a node ^3 can help to make them exactly similar in 
configuration via an edit cost operation. 

Graph matching score computation. An aggre- 
gation of both scores i.e., r.score from relation assign- 
ment and /.score from feature computation between 
the nodes yields a matching score S for data graph G 
with respect to Q 

S{Q.G) = a^ Yl ^.score(rf,^,r,,,)+ (4) 

(l-a)-— Y^ f.scom(vl,Vi),ae [0,1]. 



iev^ 



Confidence score computation. From each input 
pattern, a set of mined graphs {(G^, Sg)} will represent 
a table i.e., an output. For such an output, we compute 
corresponding confidence score (CS). CS is computed 
from the aggregation of all matching scores {Sg}'^._ 
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which is then normalised i.e., CS 
case of multiple input patterns, the outputs are ranked 
and provided on a one-to-one basis. Ranking is based 
on the order of similarity. 
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^3 

rh V23 



(a) data graph G 





v'i 


4 


vt 


A 


V- 





rt. 


4. 


A, 


A 


r\^ 





^22, 


4. 


-I 


4, 


r1 
^Z2 





4, 


-I 


rli 


rl2 


rls 






(b) adjacency matrix 
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(b) adjacency matrix 
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(c) relation vector space 
using vi as a pivotal node 
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Figure 3. Relation vector 
space to simplify relation 
assignment. In this illustra- 
tion, it shows two different 
graphs: Q and G, the corre- 
sponding adjacency matrices 
and relation vector spaces 
for a single pivotal node vi. 



Note that we aim to use set of mined graphs to it- 
eratively update the pattern graph and transform into 
a graph model so that it can be used in the absence 
of the clients - which is beyond the scope of the pa- 
per. A proof of the concept is reported in [9] and the 
thorough extension (aiming to apply document infor- 
mation content extraction, not necessarily be always 
found in structured documents like forms) has been 
made in [10]. 

3 Experiments 

3.1 Dataset and evaluation metric 

Dataset. We work on a real- world industrial problem 
in direct collaboration with the ITESOFT^, France. 
Currently, the dataset is composed of 15 classes with 
100 samples per class. For each document, clients pro- 
vide ground-truths i.e., all similar patterns within the 
table, according to the pattern selected. 

Evaluation metric. An output i.e., the detected 
table is represented by a collection of mined graphs 
O = {Gg^Sg} in a test document, and there are G° 
list of ground-truthed patterns corresponding to the 
ground-truthed table 0° = {G'°}^o°_i. Each graph G 
has a number of fields that are simply represented by 
iconic boxes {Bi)}^^^. 

To evaluate, we extend the area-ratio-based mea- 
sure proposed by Shafait and Smith [11]. It uses 
bounding boxes to describe detected tables and the 
ground-truths. In our framework, the overlapping ra- 
tio between the two boxes is defined as ORi{B^^ B^) = 

IB'^MB I ' where \B^ D B}j\ is the intersected or com- 
mon area of two bounding boxes from ground-truthed 
and detected table respectively and |5^|,|55| are 
the individual areas. Note that Oi?i(,) G [0,1]. 
We sum up all Oi?i(,) and normalise to compute 
overall overlapping ratio between ground-truth pat- 
tern G° and detected pattern G by Oi?2(G°,G) = 
^^;^^^^EORi{B^,.B,),{b^ : 6° G B°A6 G B°}. 
Then for a whole table, we can express evaluation met- 



Eval{0°,0) = 



max(G°,G) E 0R2{G°g, Gg), 



(5) 



3.2 Results and analysis 

We have validated the outputs over 15 different sup- 
pliers by taking the associated ground-truths and re- 
ported the average performance in Table 1. More 
specifically, it provides the two different ways to eval- 
uate: 

1. one is associated with the input pattern created 
in the laboratory and 

2. another one is directly related with client or real- 
world patterns. 

The first evaluation of course, aims to provide an over- 
all concept that can be applied to content extraction 
associated with the table. The latter one provides how 
robust it is. In the reported results in Table 1, we 
observe the following. 

1. Without a surprise, cleaner the input pattern, bet- 
ter the performance. This happens to be in eval. 
1 since input patterns are created in accordance 
with what OCR results. 

2. In contrast, in case of the client input patterns 
{eval. ^), a single field selection may sometimes 
take word(s) from another closer fields (can be left 
or right), and multiple lines. In that selected box 
(from clients), since OCR reads some dots (due 
to noise) as 'full-stop', 'colon' and 'semi-colon', 
it does not allow possible cleaning. As a conse- 
quence, feature properties representing the graph 
nodes can possibly varied. Fig. 6. shows an ex- 
ample of it. 

Besides, another considerable issue is the complex- 
ity of the graph-based pattern representation. In case 
of input patterns with complex structural formats (lets 
say zig-zag), such non-selected fields integration makes 
pattern graph more complex. Furthermore, as said be- 
fore, our system performance has been affected due 
to OCR errors since the system does not provide the 

Table 1. Average performance (in %) over three 
different types of table: header, body and footer. 

Table type^ Header Body Footer Avg. 
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Eval 2 



97 
96 
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Eval 1 : input patterns created in lab. 
Eval 2: input patterns from clients. 
Execution time ^ 2 sec. /doc. image. 
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Figure 4. Examples showing content extraction within the table in accordance with the input pattern (from 
client). Tables are composed of separately (a) seven and (b) three similar patterns in two different suppliers. 



expected semantics label at nodes in the graph. An 
example of the OCR effect is 'false detection' because 
of the structural similarity between the graphs. 

4 Conclusions and future perspectives 

In this paper, we have presented client-driven 
pattern-based approach to table extraction via graph 
mining scheme, inspiring from a real- world applica- 
tions. We have very much focused and validated that 
the table extraction does not always mean only to de- 
tect the presence and absence as well as to spot the 
area where table (s) is (are) located but also to select 
important key fields within it while rejecting others. 

Given an input pattern (i.e., a pattern graph), find- 
ing similar pattern graphs so that we can reinforce or 
update it iteratively each time we extract them, is one 
of the primary issues of the further work [9, 10], for 
instance. As a consequence, such models are used to 
exploit document information content in the absence 
of clients. 
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